Informedia E-Lamp@TRECVID 2012: Multimedia Event Detection and Recounting (MED and MER)
نویسندگان
چکیده
We report on our system used in the TRECVID 2012 Multimedia Event Detection (MED) and Multimedia Event Recounting (MER) tasks. For MED, generally, it consists of three main steps: extracting features, training detectors and fusion. In the feature extraction part, we extract many low-level, high-level features and text features. Those features are then represented in three different ways which are spatial bag-of words with standard tiling, spatial bag-of-words with feature and event specific tiling and the Gaussian Mixture Model Super Vector. In the detector training and fusion, two classifiers and three fusion methods are employed. The results from both of the official sources and our internal evaluations show good performance of our system. For our MER system, it takes some of the features and detection results from the MED system from which the recount is then generated. 1. MED System 1.1 Features In order to encompass all aspects of a video, we extracted a wide variety of low-level and high-level features. Table 1 summarizes the features used in our system. Among those features, most of them are widely used common feature in the community, for example, SIFT, STIP and MFCC. For those features we extracted them using standard code available from the authors of those feature with default parameters. Table 1: Features used for MED’12 system Visual Features Audio Features Low-level features SIFT (Sande, Gevers, & Snoek, 2010) Color SIFT (CSIFT) (Sande, Gevers, & Snoek, 2010) Motion SIFT (MoSIFT) (Chen & Hauptmann, 2009) Transformed Color Histogram (TCH) (Sande, Gevers, & Snoek, 2010) STIP (Willems, Tuytelaars, & Gool, 2008) Dense Trajectory (Wang, Klaser, MFCC Acoustic Unit Descriptors (AUDs) (Chaudhuri, Harvilla, & Raj, 2011) Schmid, & Liu, 2011) High-level features Semantic Indexing Concepts (SIN) (Over, et al., 2012) Object Bank (Li, Su, Xing, & Fei-Fei, 2010) Acoustic Scene Analysis Text Features Optical Character Recognition Automatic Speech Recognition Besides of those common features, we have two home-grown features which are Motion SIFT (MoSIFT) and Acoustic Unit Descriptors (AUDs) and we will introduce those two feature in the following subsections. 1.1 .1 Motion SIFT (MoSIFT) Feature The goal of developing MoSIFT feature is to combine the features from the spatial domain and the temporal domain. Local spatio-temporal features around interest points provide compact but descriptive representations for video analysis and motion recognition. Current approaches tend to extend spatial descriptions by adding a temporal component for the appearance descriptor, which only implicitly captures motion information. MoSIFT detects interest points and encodes not only their local appearance but also explicitly models local motion. The idea is to detect distinctive local features through local appearance and motion. Figure 1 demonstrates the MoSIFT algorithm. Figure 1: System flow chart of the MoSIFT algorithm. The algorithm takes a pair of video frames to find spatio-temporal interest points at multiple scales. Two major computations are applied: SIFT point detection and optical flow computation according to the scale of the SIFT points. For the descriptor, MoSIFT adapts the idea of grid aggregation in SIFT to describe motions. Optical flow detects the magnitude and direction of a movement. Thus, optical flow has the same properties as appearance gradients. The same aggregation can be applied to optical flow in the neighborhood of interest points to increase robustness to occlusion and deformation. The two aggregated histograms (appearance and optical flow) are combined into the MoSIFT descriptor, which now has 256 dimensions. 1.1 .2 Acoustic Unit Descriptors (AUDs) We have developed an unsupervised lexicon learning algorithm that automatically learns units of sound. Each unit is such that it spans a set of audio frames, thereby taking local acoustic context into account. Using a maximum-likelihood estimation process, we can learn a set of such acoustic units unsupervised from audio data. Each of these units can be thought of as low-level fundamental units of sound, and each audio frame is generated by these units. We refer to these units as Acoustic Unit Descriptors (AUDs) and we expect that the distribution of these units will carry information about the semantic content of the audio stream. Each AUD is represented by a 5-state Hidden Markov Model (HMM) with a 4-gaussian mixture output density function. Ideally, with a perfect learning process, we would like to learn semantically interpretable lowerlevel units, such as a clap, a thud sound, a bang, etc. Naturally, it is hard to enforce semantic interpretability on the audio learning process at that level of detail. Further, because the space of all possible sounds is so large, many different sounds will be mapped into single sounds at learning time, since we can only learn a finite set of units. 1.2 Feature Representat ions In the previous section, we briefly describe the features we used in the system and in this section we describe the representations we used for the raw features extraction in Section 1. Three representations were used in you system which were k-means based spatial bag-ofwords model with standard tiling (Lazebnik, Schmid, & Ponce, 2006), k-means based spatial bag-of-words with feature and event specific tiling (Viitaniemil & Laaksonen, 2009) and Gaussian Mixture Model Super Vector (Campbell & Sturim, 2006). Since the k-means based spatial bag-of-words model with standard tiling and Gaussian Mixture Model Super Vector are standard technology we will focus on the k-means based spatial bag-of-words model with feature and event specific tiling, for the simplicity, we call it tiling. Spatial bag-of-words model is a widely used representation of the low-level image/video features. The central idea of spatial bag-of-words model is to divide the image into some small tiles which is also called tiling. Figure 2 shows a couple of tiling examples. Figure 2: Examples of tiling In general, the spatial bag-of-words model uses the 1x1, 2x2 and 4x4 tiling. However the use of those tilings is ad-hoc and some preliminary works have shown that other tilings might produce better performance (Viitaniemil & Laaksonen, 2009). In our system, we systematically tested 80 different tilings to select the best one for each feature and each event. Table 2 shows the performance of feature specific tiling v.s. the standard tiling (for the details of datasets and evaluation metric please refer to the description in the Section 3). From the table, we can see clearly that for all of the five features, the feature tiling performs consistently at least 1% better than the standard tiling. Table 2: The performance of feature specific tiling and standard tiling Feat Featurure SIFT CSIFT TCH STIP MOSIFT Feature Specific 0.4209 0.4496 0.4914 0.5178 0.4330 Tiling Standard Tiling 0.4325 0.4618 0.5052 0.5234 0.4456 Figure 3 shows an example of the performance of event specific tiling v.s. standard tiling on a difficult event identified in our experiments which is E025. It can be seen clearly that the event specific tiling can improve the performance over standard tiling noticeably. Figure 3: The comparison of event specific tiling on event E025 1.3 Training and Fusion We used the standard MED’12 training dataset for our internal evaluation and the training of the models for our submission. For our internal evaluation, the MED’12 training dataset was further divided into the training set and testing set by randomly selecting half of the positive examples into the training set and the rest half into the testing set. The negative examples consisted of only NULL videos which do not have label information. Two classifiers were used in the system which were kernel SVM and kernelized rigid regression (for the simplicity, we refer to it as kernel regression). For the k-means based feature representations we used Chi2 kernel and for the GMM based representation RBF kernel was used. The parameters of the model were tuned by 5-fold cross validation and the PMiss @TER=12.5 was used as the evaluation metric. For combining features from multiple modalities and the outputs of different classifiers, we used fusion and ensemble methods. More specifically, for the same classifier with different features we used three fusion methods which were early fusion, late fusion and double fusion (Lan, Bao, Yu, Liu, & Hauptmann, 2012). In early Fusion the kernel matrices from different features were normalized first and then combined together while in late fusion the prediction scores from the models trained using different features were combined. In our system, we also used a fusion method called double fusion, which combines early fusion and late fusion together. Finally, the results from different classifiers were ensemble together. Figure 4 shows the diagram of our system. 0.55 0.57 0.59 0.61 0.63 0.65 0.67 0.69 0.71 0.73 0.75 CSIFT SIFT MOSIFT STIP TCH PM is s@ 12 .5 E025 Marriage_proposal baseline event_specific
منابع مشابه
Informedia E-Lamp @ TRECVID 2013: Multimedia Event Detection and Recounting (MED and MER)
We report on our system used in the TRECVID 2013 Multimedia Event Detection (MED) and Multimedia Event Recounting (MER) tasks. For MED, it consists of four main steps: extracting features, representing features, training detectors and fusion. In the feature extraction part, we extract more than 10 low-level, high-level, and text features. Those features are then represented in three different w...
متن کاملInformedia@trecvid 201 4 Med and Mer Med System
We report on our system used in the TRECVID 2014 Multimedia Event Detection (MED) and Multimedia Event Recounting (MER) tasks. On the MED task, the CMU team achieved leading performance in the Semantic Query (SQ), 000Ex, 010Ex and 100Ex settings. Furthermore, SQ and 000Ex runs are significantly better than the submissions from the other teams. We attribute the good performance to 4 main compone...
متن کاملTRECVid 2012 Experiments at Dublin City University
Following previous participations in TRECVid, this year, the DCU-IAD team participated in four tasks of TRECVid 2012: Instance Search (INS), Interactive Known-Item Search (KIS), Multimedia Event Detection (MED) and Multimedia Event Recounting (MER).
متن کاملTRECVID 2012 GENIE: Multimedia Event Detection and Recounting
Our MED 12 system is an extension of our MED 11 system [11], and consists of a collection of lowlevel and high-level features, feature-specific classifiers built upon those features, and a fusion system that combines features both through mid-level kernel fusion and score fusion. We have incorporated large number of audio-visual features in our new system and incorporated diverse types of stand...
متن کاملSRI-Sarnoff AURORA System at TRECVID 2012 Multimedia Event Detection and Recounting
In this paper, we describe the evaluation results for TRECVID 2012 Multimedia Event Detection (MED) and Multimedia Event Recounting (MER) tasks as a part of SRI-Sarnoff AURORA system that is developed under the IARPA ALDDIN program. In AURORA system, we incorporated various low-level features that capture color, appearance, motion, and audio information in videos. Based on these low-level featu...
متن کاملTRECVID 2013 GENIE: Multimedia Event Detection and Recounting
Our MED 13 system is an extension of our MED 12 system [12, 13], and consists of a collection of lowlevel and high-level features, feature-specific classifiers built upon those features, and a fusion system that combines features both through mid-level kernel fusion and late fusion. Our MED submissions include total of 24 different configurations which consist of combinations of 2 submission ti...
متن کامل